This .Rmd file pulls AirCasting data from API, clean and tidy them.

Mobile sessions

Now we have fixed sessions and mobile sessions available. Ideally, we will use mobile sessions to do any analysis, and use fixed sessions as “knots” to validate the measurements. However, we don’t know any models that can do that job. Anyway, I will pull data from mobile sessions first and see what they look like. I will also remove those that are obviously wrong.

I will follow Chris’s method:

Step 1

Here are the usernames that I found on the AirCasting website:
NYCEJA, BLU 12, HabitatMap, BCCHE, scooby, Ana BCCHE, Ricardo Esparza, Tasha kk, lana, Marisa Sobel, Wadesworld18, El Puente 3, El Puente 4, El Puente 2, El Puente 1, mahdin, El Puente 5, Asemple, patqc, sjuma

usernames <- c('NYCEJA', 'BLU%2012', 'HabitatMap', 'BCCHE', 'scooby', 'Ana%20BCCHE', 'Ricardo%20Esparza', 'Tasha%20kk', 'lana', 'Marisa%20Sobel', 'Wadesworld18', 'El%20Puente%201', 'El%20Puente%202', 'El%20Puente%203', 'El%20Puente%204', 'El%20Puente%205', 'mahdin', 'Asemple', 'patqc', 'sjuma')

user_test <- c('NYCEJA', 'HabitatMap', 'BCCHE', 'lana', 'Wadesworld18', 'patqc')

write a function that takes one username from the username vector, plugs it into the API call, and extracts the session IDs

fetch_id <- function(name){
  api_call <- str_c('http://aircasting.org/api/sessions.json?page=0&page_size=500&q[measurements]=true&q[time_from]=0&q[time_to]=2552648500&q[usernames]=', name)
  api_pull <- jsonlite::fromJSON(api_call)
  user_id <- api_pull$streams$'AirBeam2-PM2.5'$id %>% 
    .[!is.na(.)]

  user_id
}

pulled_ids <- map(usernames, fetch_id) %>% 
  unlist()
# This function plug each ID into the measurement API call and pulls data using that ID
pull_fun <- function(id_element){
  test_sess <- str_c("http://aircasting.org/api/realtime/stream_measurements.json/?end_date=2281550369000&start_date=0&stream_ids[]=",id_element) %>% 
    jsonlite::fromJSON(.) %>% 
    mutate(id = id_element) %>% 
    as_tibble()

  test_sess
}

# the output of pull_fun is a list. Take each element of the list and combine them into a tibble
airbeam_data <- map(pulled_ids, pull_fun) %>% 
  do.call("bind_rows", .)

<<<<<<< HEAD

Data cleaning for mobile sessions

We did the following to clean the data:

  1. Seperated date and time and created a date variable
  2. Removed the values that beyond the range (0, 250), and divided the data into two subsets: a subset with regular values and the other with extremely high values. We analyzed them seperately
  3. Removed or edited the latitude and longitude that were not in New York.
airbeam_data_tidy = airbeam_data %>% 
  separate(time, into = c("year", "month", "day"), sep = "-") %>% 
  separate(day, into = c("day", "time"), sep = "T") %>% 
  separate(time, into = c("hour", "min", "sec"), sep = ":") %>% 
  separate(sec, into = c("sec", "remove"), sep = "Z") %>% 
  select(-remove) %>% 
  mutate(
    date = str_c(year, month, day, sep = '-'),
    date = as.Date(date)
  ) %>% 
  filter(value > 0 & value < 1000) %>% 
  filter(latitude > 38 & longitude < -70) %>% 
  filter(latitude > 40 & longitude > -75)

Subset 1: regular measurements

airbeam_reg <- airbeam_data_tidy %>% 
  filter(value <= 250)

Subset 2: extremely high values

airbeam_high <- airbeam_data_tidy %>% 
  filter(value > 250)

Our analysis willl primarily focus on the regular measurements. However, extremely high measurements are also useful, because they could help identify potential sources that cause peaks in PM2.5. These two subsets will be analyzed seperately.

Visualization

Here are the questions that we want to answer in the visualization:

  1. In general, where are PM2.5 measured in the city? Which areas have more observations and which has less?
  2. How does PM2.5 observations change throughout a day?
  3. Are there any monthly or seasonal trends?
  4. Where do the extreme values appear?

Q1: Where are PM2.5 measured?

Here’s a plot that shows the spatial distribution of all the locations. I have already removed those that are not in New York.

location_reg <- airbeam_reg %>% 
  group_by(latitude, longitude) %>% 
  summarize(avg_pm = mean(value)) 

leaflet() %>% 
  addTiles() %>% 
  addCircleMarkers(
    data = location_reg,
    lat = ~latitude, lng = ~longitude,
    color = 'green',
    radius = 3
  )

Q2: How does PM2.5 observations change throughout a day?

airbeam_reg %>% 
  mutate(hour = as.numeric(hour)) %>% 
  group_by(hour) %>% 
  summarize(pm_avg = mean(value)) %>% 
  ggplot(aes(x = hour, y = pm_avg)) + geom_line()

Q3: Monthly or seasonal trends?

monthly_averages = airbeam_reg %>% 
  group_by(month, year) %>% 
  mutate(average = mean(value))

airbeam_reg %>%
  group_by(month, year) %>%
  summarize(average = mean(value), pm_max = max(value), pm_min = min(value), observation_count = n()) %>%
  knitr::kable()
month year average pm_max pm_min observation_count
01 2019 7.461890 148 1 66033
02 2019 8.387737 232 1 98456
03 2018 3.203922 6 1 510
03 2019 4.750883 31 1 1983
04 2019 5.328453 23 1 9015
05 2018 14.445833 21 10 480
07 2018 10.765362 215 1 160149
08 2018 16.740977 110 1 46243
09 2018 3.031858 33 1 2825
10 2018 3.807418 142 1 18335
11 2018 5.563186 234 1 19047
12 2018 8.987813 41 1 16083

In this table, we see that average PM2.5 peaks in the summertime.

Q4: Where do the extreme values appear?

location_high <- airbeam_high %>% 
  group_by(latitude, longitude) %>% 
  summarize(avg_pm = mean(value)) 

leaflet() %>% 
  addTiles() %>% 
  addCircleMarkers(
    data = location_high,
    lat = ~latitude, lng = ~longitude,
    color = 'green',
    radius = 3
  )